When a measure becomes a target, it ceases to be a good measure - Charles Goodhart/Marilyn Strathern (1997)
Optionally: Split training data again for model tuning. To tune, fit models on the `nested’ training data and select hyper-parameter configuration that optimises performances on the nested test set.
# 1. Pick a model
learn = lrn("regr.lm")
# 2. Load data
task = tsk("mtcars")
print(task)
## <TaskRegr:mtcars> (32 x 11)
## * Target: mpg
## * Properties: -
## * Features (10):
## - dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
task$head()
## mpg am carb cyl disp drat gear hp qsec vs wt
## 1: 21.0 1 4 6 160 3.90 4 110 16.46 0 2.620
## 2: 21.0 1 4 6 160 3.90 4 110 17.02 0 2.875
## 3: 22.8 1 1 4 108 3.85 4 93 18.61 1 2.320
## 4: 21.4 0 1 6 258 3.08 3 110 19.44 1 3.215
## 5: 18.7 0 2 8 360 3.15 3 175 17.02 0 3.440
## 6: 18.1 0 1 6 225 2.76 3 105 20.22 1 3.460
# 3. Split data with 2/3 hold-out split
train = sample(task$nrow, 2/3 * task$nrow)
test = setdiff(seq(task$nrow), train)
# 3. Fit model to data
learn$train(task, row_ids = train)
learn$model
##
## Call:
## stats::lm(formula = task$formula(), data = task$data())
##
## Coefficients:
## (Intercept) am carb cyl disp drat
## 15.51914 4.28928 -0.04438 0.40173 0.03062 -1.84221
## gear hp qsec vs wt
## 1.47485 -0.04911 1.06293 2.13296 -5.71573
# 4. Make predictions on new data
pred = learn$predict(task, row_ids = test)
print(pred)
## <PredictionRegr> for 11 observations:
## row_id truth response
## 3 22.8 27.57143
## 6 18.1 22.80786
## 8 24.4 22.74496
## ---
## 26 27.3 30.09445
## 28 30.4 30.56697
## 31 15.0 11.39905
# 5. Evaluate
pred$score()
## regr.mse
## 15.91255
Arguments against validation:
Arguments for validation:
gr = Graph$new()$
add_pipeop(po("learner", lrn("regr.svm")))$
add_pipeop(po("learner", lrn("regr.xgboost")))$
add_pipeop(po("learner", lrn("regr.ranger")))$
add_pipeop(po("regravg"))$
add_edge("regr.svm", "regravg")$
add_edge("regr.xgboost", "regravg")$
add_edge("regr.ranger", "regravg")
plot(gr, html = TRUE)
learn_avg = GraphLearner$new(gr)
gr = Graph$new()$
add_pipeop(po("learner", lrn("regr.svm")))$
add_pipeop(po("learner", lrn("regr.xgboost")))$
add_pipeop(po("learner", lrn("regr.ranger")))$
add_pipeop(po("learner", lrn("regr.lm")))$
add_pipeop(po("regravg"))$
add_edge("regr.svm", "regravg")$
add_edge("regr.xgboost", "regravg")$
add_edge("regr.ranger", "regravg")$
add_edge("regr.lm", "regravg")
plot(gr, html = TRUE)
learn_avg_lm = GraphLearner$new(gr)
learn_lm = lrn("regr.lm")
task = tsk("mtcars")
design = benchmark_grid(task, list(learn_avg, learn_avg_lm, learn_lm), rsmp("cv", folds = 5))
bm = benchmark(design)
The winner is…
bm$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))[,c("learner_id", "regr.rmse", "regr.rmse_se")]
## learner_id regr.rmse regr.rmse_se
## 1: regr.svm.regr.xgboost.regr.ranger.regravg 5.657743 1.3362217
## 2: regr.svm.regr.xgboost.regr.ranger.regr.lm.regravg 4.461551 1.1802586
## 3: regr.lm 3.295919 0.7509652
External validation is the gold-standard for evaluation as not only is the data completely unseen and new to the model, but the underlying data-generation process may even be different. If data is trained and tested on the exact same data then no indication is given to how the model may perform on new data. In general a model’s training performance does not correlate with its testing performance. Moreover, without external validation, there is no way to tell if your dataset is even representative, which can lead to strange results…
The TOT model (Tan, Osman, and Tan 1997): \(Y = 88.1 + 0.51X\) where \(Y\) is ear circumference (mm) and \(X\) is age (years).
Their conclusion:
“In any event, this knowledge may be useful for forensic scientists and can be an alternative to dental records in the future. It may be speculated that if Van Gogh’s ear had been fossilized, even if it were to be found 2000 years from now, one could always calculate the age at which he lost his ear from self mutilation!”
External validation (Mateen and Sonabend 2019):
The model was tested on 100 academics at The Alan Turing Institute. The conclusion found that always predicting someone’s ear circumference as the global median was as good as the TOT model.
The problem:
Their model was fit on 100 residents and staff from a single Texan old-age home, which is not really representative of the world’s population…
Many models have been proposed to predict the probability of death or disease progression from COVID-19. A rigorously tested model with significant predictive power allows it to be used in hospitals and government to inform policy. On the other hand, a model that is only superficially well-performing could result in incorrect and possibly harmful decisions being made.
A recent (pre-print!!!) systematic review (Gupta et al. 2020) found that of 22 models that claimed to be predictive of COVID-19 mortality, none stood up to external validation and were all deemed unsuitable.
In the vast majority of experiments, external validation is simply not possible. This can be for a variety of reasons but usually this is simply due to lack of data. Ultimately researchers will (and should) prefer to use as much data as possible to fit a model. Furthermore, when data is collected for a specific experiment, then there will genuinely be no other external datasets to compare to until another study comes along and requires collection of their own data.
Luckily, procedures exist to estimate the results of external validation.
There are two types of error we are discussing: training and prediction error. Training error is calculated by evaluating predictions on the exact same data used for fitting, testing error is calculated by evaluating predictions on unseen data after fitting. Training error is (almost) always over-optimistic, i.e. lower, than the true prediction error (Hastie, Tibshirani, and Friedman 2001).
task = tsk("mtcars")
learn = lrn("regr.lm")
errors = replicate(100, {
train = sample(task$nrow, 2/3 * task$nrow)
test = setdiff(seq(task$nrow), train)
train_err = as.numeric(learn$train(task)$predict(task)$score())
test_err = as.numeric(learn$train(task, row_ids = train)$predict(task, row_ids = test)$score())
return(list(train = train_err, test = test_err))
})
boxplot(unlist(errors[2,]))
abline(h = learn$train(task)$predict(task)$score(), col = 2)
True internal validation, i.e. where identical data is used for training and testing, is rarely used in practice, instead resampling methods are utilised. The simplest resampling method we have already seen above, ‘hold-out’. Hold-out is the simple case of randomly sampling the data into a training set and a testing set, the usual split is 2/3 of the data for training and 1/3 for testing (Hastie, Tibshirani, and Friedman 2001). Whilst this is a simple procedure and allows for the majority of data to be used for model fitting, it does not provide an estimate of the out-of-sample, or generalisation, error: the prediction error over an independent test sample. So-called ‘generalisation’ as it provides a mechanism for determining how well the model ‘generalises’, i.e. performs on datasets other than the one used for training. \(K\)-fold cross-validation is a resampling procedure that effectively estimates this error. The cross-validation procedure is relatively simple and is displayed algorithmically below, the dataset is split into \(K\) `folds’, \(K-1\) are used for training and the \(K\)th is held-out for testing. This procedure is repeated \(K\) times so that all folds can be used for testing at same stage. The predictions on the \(K\) folds are evaluated and aggregated into a single score.
Given a dataset \(d\):
learn = lrn("regr.lm")
task = tsk("mtcars")
rr = resample(task, learn, rsmp("cv", folds = 5))
rr$score()[,c("learner_id", "iteration", "regr.mse")]
## learner_id iteration regr.mse
## 1: regr.lm 1 6.354374
## 2: regr.lm 2 43.089049
## 3: regr.lm 3 8.415706
## 4: regr.lm 4 12.899121
## 5: regr.lm 5 7.237533
rr$aggregate()
## regr.mse
## 15.59916
learn = lrn("regr.lm")
task = tsk("mtcars")
resamp = replicate(50, {
score = c()
for (i in c(5, 10)) {
rr = resample(task, learn, rsmp("cv", folds = i))
score = c(score, as.numeric(rr$aggregate()))
}
return(score)
})
loocv = as.numeric(resample(task, learn, rsmp("cv", folds = task$nrow))$aggregate())
plot(resamp[1,], type = "l", xlab = "Simulation", ylab = "MSE")
lines(resamp[2,], col = 2)
lines(rep(loocv, 50), col = 3)
legend("topright", lwd = 1, col = 1:3, legend = c(5, 10, 32))
The example at the start comparing linear regression to ensemble methods demonstrates the need for benchmark experiments. Benchmark experiments compare the performance of multiple models in order to determine which is the ‘best’. Unfortunately, even using a suitable resampling method to estimate prediction error is not enough to determine if one model is ‘better’ than another. Determining if one model ‘outperforms’ another comes down to comparison of standard errors and significance testing.
In practice formal comparison between models is rarely observed, even in top journals (Király, Mateen, and Sonabend 2018). Yet these are vital to establishing if one model outperforms another. Whilst allowances can be made for a lack of availability in significance testing; confidence intervals are simple to calculate…
In the simplest case, Normal approximations can be used to derive confidence intervals. As a standard caveat, when multiple experiments are performed, multiple testing correction should be applied to the constructed confidence intervals.
learns = lrns(paste0("regr.", c("lm", "rpart")))
task = tsk("mtcars")
design = benchmark_grid(task, learns, rsmp("cv", folds = 5))
res = benchmark(design)$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))
plot(res$regr.rmse, ylim = c(2, 7), xaxt = "n", ylab = "RMSE", xlab = "")
arrows(1:2, y0 = res$regr.rmse - (res$regr.rmse_se * 1.96),
y1 = res$regr.rmse + (res$regr.rmse_se * 1.96), code = 3, angle = 90)
A slightly more rigorous method is to perform hypothesis testing on the results. One method to do so is to use a Wilcoxon signed-rank test to compare the residuals across all folds. Effectively this tests if there is a significantly lower location shift of the residuals produced by a given model/measure compared to another.
learners <- list(makeLearner("regr.lm", id = "lm"),
makeLearner("regr.randomForest", id = "RF"),
makeLearner("regr.nnet", id = "NN"),
makeLearner("regr.featureless"))
bmresults <- benchmark(learners, bh.task, cv5, list(rmse, rmse.se, mae, mae.se))
test <- comparisontest(bmresults)[[1]][[1]]
matrix(p.adjust(test, method = "BH"), nrow = 4, ncol = 4, dimnames = dimnames(test))
## RF lm NN regr.featureless
## RF 1 3.68824e-05 1.477089e-10 1.477089e-10
## lm 1 1.00000e+00 4.676359e-07 4.676359e-07
## NN 1 1.00000e+00 1.000000e+00 1.000000e+00
## regr.featureless 1 1.00000e+00 9.821731e-02 1.000000e+00
A final complication is presented by model tuning. Tuning is the process of taking a set of possible configuration for model hyper-parameters, trying each configuration by fitting/predicting the model on data, and selecting the configuration with the best performance.
1) The wrong way
2) Still wrong
3) The correct method
Note:
Ideally three datasets are used, one for model tuning, one for training, and one for predictions. But in practice this is rarely possible, so nested cross-validation serves as a useful estimate to approximate this.
task = tsk("mtcars")
instance = TuningInstanceSingleCrit$new(
task = task,
learner = lrn("regr.svm", type = "nu-regression"),
resampling = rsmp("cv", folds = 3),
measure = msr("regr.rmse"),
search_space = ParamSet$new(list(ParamDbl$new("nu", lower = 0, upper = 1))),
terminator = trm("evals", n_evals = 10)
)
tt = tnr("random_search")
tt$optimize(instance)
instance$result
## nu learner_param_vals x_domain regr.rmse
## 1: 0.4679156 <list[2]> <list[1]> 3.351495
svm_tuned = AutoTuner$new(
learner = lrn("regr.svm", type = "nu-regression"),
resampling = rsmp("cv", folds = 3),
measure = msr("regr.rmse"),
search_space = ParamSet$new(list(ParamDbl$new("nu", lower = 0, upper = 1))),
terminator = trm("evals", n_evals = 60),
tuner = tnr("random_search")
)
lm = lrns(c("regr.lm", "regr.featureless"))
bm = benchmark(benchmark_grid(task, c(list(svm_tuned), lm), rsmp("cv", folds = 5)))
bm$aggregate(msrs(c("regr.rmse", "regr.rmse_se")))[,c("learner_id", "regr.rmse", "regr.rmse_se")]
## learner_id regr.rmse regr.rmse_se
## 1: regr.svm.tuned 3.479344 0.8513176
## 2: regr.lm 3.150791 0.7443326
## 3: regr.featureless 5.940268 1.5540413
Binder, Martin, Florian Pfisterer, Bernd Bischl, Michel Lang, and Susanne Dandl. 2019. “mlr3pipelines: Preprocessing Operators and Pipelines for ’mlr3’.” CRAN.
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1—–5. http://jmlr.org/papers/v17/15-066.html.
Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791—–799. http://links.jstor.org/sici?sici=0162-1459{\%}28197612{\%}2971{\%}3A356{\%}3C791{\%}3ASAS{\%}3E2.0.CO{\%}3B2-W.
Gupta, Rishi K, Michael Marks, Thomas H. A. Samuels, Akish Luintel, Tommy Rampling, Humayra Chowdhury, Matteo Quartagno, et al. 2020. “Systematic Evaluation and External Validation of 22 Prognostic Models Among Hospitalised Adults with Covid-19: An Observational Cohort Study.” medRxiv. Cold Spring Harbor Laboratory Press. https://doi.org/10.1101/2020.07.24.20149815.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer New York Inc.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. New York: Springer.
Király, Franz J, Bilal Mateen, and Raphael Sonabend. 2018. “NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature,” December. http://arxiv.org/abs/1812.07519.
Lang, Michel, Quay Au, Stefan Coors, and Patrick Schratz. 2020. “mlr3learners: Recommended Learners for ’mlr3’.” CRAN.
Lang, Michel, Martin Binder, Jakob Richter, Patrick Schratz, Florian Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff, and Bernd Bischl. 2019. “mlr3: A modern object-oriented machine learning framework in R.” Journal of Open Source Software 4 (44): 1903. https://doi.org/10.21105/joss.01903.
Lang, Michel, Bernd Bischl, Jakob Richter, Xudong Sun, and Martin Binder. 2019. “paradox: Define and Work with Parameter Spaces for Complex Algorithms.” CRAN. https://cran.r-project.org/package=paradox.
Lang, Michel, Jakob Richter, Bernd Bischl, and Daniel Schalk. 2019. “mlr3tuning: Tuning for ’mlr3’.” CRAN. https://cran.r-project.org/package=mlr3tuning.
Mateen, Bilal, and Raphael Sonabend. 2019. “All I want for Christmas is…Rigorous validation of predictive models to prevent hasty generalisations.” Significance 16 (6). John Wiley & Sons, Ltd (10.1111): 20–24. https://doi.org/10.1111/j.1740-9713.2019.01336.x.
Polley, Eric C, and Mark J Van Der Laan. 2010. “Super learner in prediction.” bepress.
Tan, R, V Osman, and G Tan. 1997. “Ear Size as a Predictor of Chronological Age.” Archives of Gerontology and Geriatrics 25 (2). Elsevier: 187–91.